feat: add JSoup HTML document reader #2245

apappascs · 2025-02-14T15:54:30Z

This commit introduces the JsoupDocumentReader and JsoupDocumentReaderConfig classes, which provide functionality to read and parse HTML documents using the JSoup library.

The reader supports:

Extracting text from specific HTML elements using CSS selectors.
Extracting all text from the body of the document.
Grouping text by element.
Extracting metadata, including the document title, meta tags, and link URLs.
Reading from various resource types (files, URLs, byte arrays).
Configurable character encoding, selector, separator, and metadata extraction.

This commit introduces the `JsoupDocumentReader` and `JsoupDocumentReaderConfig` classes, which provide functionality to read and parse HTML documents using the JSoup library. The reader supports: - Extracting text from specific HTML elements using CSS selectors. - Extracting all text from the body of the document. - Grouping text by element. - Extracting metadata, including the document title, meta tags, and link URLs. - Reading from various resource types (files, URLs, byte arrays). - Configurable character encoding, selector, separator, and metadata extraction. This new reader enhances Spring AI's ability to process web content and other HTML-based data sources. Signed-off-by: Alexandros Pappas <[email protected]>

ilayaperumalg · 2025-03-10T11:41:02Z

@apappascs This is a nice addition and thank you for adding! Rebased and merged as 82b46d2

apappascs force-pushed the feature/jsoup-html-reader branch from 2879e6c to c0ef4ac Compare February 14, 2025 15:55

ilayaperumalg self-assigned this Mar 10, 2025

ilayaperumalg added the document-reader label Mar 10, 2025

ilayaperumalg added this to the 1.0.0-M7 milestone Mar 10, 2025

ilayaperumalg closed this Mar 10, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add JSoup HTML document reader #2245

feat: add JSoup HTML document reader #2245

Uh oh!

apappascs commented Feb 14, 2025

Uh oh!

ilayaperumalg commented Mar 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat: add JSoup HTML document reader #2245

feat: add JSoup HTML document reader #2245

Uh oh!

Conversation

apappascs commented Feb 14, 2025

Uh oh!

ilayaperumalg commented Mar 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants